This report explores a data set containing financial contributions made by California residents to Presidential candidates in the 2016 Presidential election.

Univariate Plots Section

## 'data.frame':    1125659 obs. of  19 variables:
##  $ cmte_id          : Factor w/ 25 levels "C00458844","C00500587",..: 6 6 6 7 7 7 7 6 7 7 ...
##  $ cand_id          : Factor w/ 25 levels "P00003392","P20002671",..: 1 1 1 12 12 12 12 1 12 12 ...
##  $ cand_nm          : Factor w/ 25 levels "Bush, Jeb","Carson, Benjamin S.",..: 4 4 4 20 20 20 20 4 20 20 ...
##  $ contbr_nm        : Factor w/ 195943 levels "0, J.","AAGAARD, DAVID",..: 6546 26587 59263 99779 101025 101025 101084 78017 101106 101131 ...
##  $ contbr_city      : Factor w/ 2118 levels "",".","1000 OAKS",..: 937 256 603 255 1503 1503 2003 887 2047 1371 ...
##  $ contbr_st        : Factor w/ 1 level "CA": 1 1 1 1 1 1 1 1 1 1 ...
##  $ contbr_zip       : Factor w/ 134034 levels "","00000","000090272",..: 107295 66913 50011 61864 16081 16081 42612 54660 57711 108598 ...
##  $ contbr_employer  : Factor w/ 57983 levels "","-","--","---",..: 34229 34229 34229 4075 54897 54897 36555 34229 35407 44971 ...
##  $ contbr_occupation: Factor w/ 25386 levels "","-","--","---",..: 18976 18976 18976 21213 15896 15896 17393 18976 14740 6604 ...
##  $ contb_receipt_amt: num  50 200 5 40 35 100 25 40 10 15 ...
##  $ contb_receipt_dt : Factor w/ 659 levels "01-APR-15","01-APR-16",..: 540 408 25 78 99 121 78 408 99 121 ...
##  $ receipt_desc     : Factor w/ 74 levels "","2016 SENATE PRIMARY DONOR REDESIGNATION FROM PRIMARY",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ memo_cd          : Factor w/ 2 levels "","X": 2 2 2 1 1 1 1 2 1 1 ...
##  $ memo_text        : Factor w/ 423 levels "","*","$0.02 REFUNDED ON 10/21/2016",..: 198 198 198 151 151 151 151 198 151 151 ...
##  $ form_tp          : Factor w/ 3 levels "SA17A","SA18",..: 2 2 2 1 1 1 1 2 1 1 ...
##  $ file_num         : int  1091718 1091718 1091718 1077404 1077404 1077404 1077404 1091718 1077404 1077404 ...
##  $ tran_id          : Factor w/ 1122205 levels "A000771210424405B8CF",..: 327338 326620 324002 858677 860121 862422 858139 326658 860117 863334 ...
##  $ election_tp      : Factor w/ 4 levels "","G2016","P2016",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ X                : logi  NA NA NA NA NA NA ...
##       cmte_id            cand_id                            cand_nm      
##  C00575795:547211   P00003392:547211   Clinton, Hillary Rodham  :547211  
##  C00577130:407172   P60007168:407172   Sanders, Bernard         :407172  
##  C00574624: 57820   P60006111: 57820   Cruz, Rafael Edward 'Ted': 57820  
##  C00580100: 50168   P80001571: 50168   Trump, Donald J.         : 50168  
##  C00573519: 27362   P60005915: 27362   Carson, Benjamin S.      : 27362  
##  C00458844: 14092   P60006723: 14092   Rubio, Marco             : 14092  
##  (Other)  : 21834   (Other)  : 21834   (Other)                  : 21834  
##               contbr_nm              contbr_city     contbr_st   
##  MITCHELL, MARCIA  :    388   LOS ANGELES  : 88041   CA:1125659  
##  PETIT, MICHAEL    :    352   SAN FRANCISCO: 78340               
##  CARROLL, TERI     :    333   SAN DIEGO    : 39967               
##  SAMATUA, DENISE   :    332   OAKLAND      : 28998               
##  SMITH, CHERYL     :    324   SAN JOSE     : 26420               
##  MONTANELLI, TERESA:    295   SACRAMENTO   : 20651               
##  (Other)           :1123635   (Other)      :843242               
##      contbr_zip           contbr_employer               contbr_occupation 
##  92660    :    449   N/A          :148625   RETIRED              :215855  
##  92037    :    407   RETIRED      :103846   NOT EMPLOYED         :113153  
##  900363146:    388   SELF-EMPLOYED: 94783   ATTORNEY             : 30572  
##  911075001:    352   NONE         : 84250   TEACHER              : 25486  
##  926372766:    335   NOT EMPLOYED : 49114   INFORMATION REQUESTED: 20936  
##  932631317:    333   (Other)      :644474   (Other)              :719524  
##  (Other)  :1123395   NA's         :   567   NA's                 :   133  
##  contb_receipt_amt   contb_receipt_dt  
##  Min.   :-10500.0   29-FEB-16:  11735  
##  1st Qu.:    15.0   31-MAR-16:  11506  
##  Median :    27.0   31-MAY-16:  10435  
##  Mean   :   121.8   30-APR-16:   9479  
##  3rd Qu.:    97.0   26-SEP-16:   9237  
##  Max.   : 10800.0   08-JUN-16:   8901  
##                     (Other)  :1064366  
##                                   receipt_desc     memo_cd   
##                                         :1110614    :981391  
##  Refund                                 :   8568   X:144268  
##  REDESIGNATION FROM PRIMARY             :   1324             
##  REDESIGNATION TO GENERAL               :   1324             
##  REATTRIBUTION / REDESIGNATION REQUESTED:    569             
##  REDESIGNATION TO CRUZ FOR SENATE       :    544             
##  (Other)                                :   2716             
##                                memo_text       form_tp      
##                                     :624511   SA17A:979490  
##  * EARMARKED CONTRIBUTION: SEE BELOW:390588   SA18 :137601  
##  * HILLARY VICTORY FUND             :100319   SB28A:  8568  
##  REDESIGNATION FROM PRIMARY         :  1324                 
##  REDESIGNATION TO GENERAL           :  1324                 
##  *BEST EFFORTS UPDATE               :  1075                 
##  (Other)                            :  6518                 
##     file_num                       tran_id        election_tp   
##  Min.   :1003942   A5602AD777C8C4632B5A:      4        :  1425  
##  1st Qu.:1077665   ADB49CB248C174E298F0:      4   G2016:313746  
##  Median :1091720   A26C35A6066754130B99:      3   P2016:810481  
##  Mean   :1090286   A340DF85B7F884133A20:      3   P2020:     7  
##  3rd Qu.:1104813   A4E50E2DD07E4475996F:      3                 
##  Max.   :1119833   A7C22FA389E0348F98F0:      3                 
##                    (Other)             :1125639                 
##     X          
##  Mode:logical  
##  NA's:1125659  
##                
##                
##                
##                
## 

The data set contains 1125659 donations and 19 variables. After looking at the variables, I have decided that most of them are not necessary to my analysis. As a result, I will remove these variables from my data frame and then continue on with my plots. The resulting data frame has the same number of observations but has been whittled down to only 4 variables.

## 'data.frame':    1125659 obs. of  4 variables:
##  $ cand_nm          : Factor w/ 25 levels "Bush, Jeb","Carson, Benjamin S.",..: 4 4 4 20 20 20 20 4 20 20 ...
##  $ contbr_city      : Factor w/ 2118 levels "",".","1000 OAKS",..: 937 256 603 255 1503 1503 2003 887 2047 1371 ...
##  $ contbr_occupation: Factor w/ 25386 levels "","-","--","---",..: 18976 18976 18976 21213 15896 15896 17393 18976 14740 6604 ...
##  $ contb_receipt_amt: num  50 200 5 40 35 100 25 40 10 15 ...
##                       cand_nm              contbr_city    
##  Clinton, Hillary Rodham  :547211   LOS ANGELES  : 88041  
##  Sanders, Bernard         :407172   SAN FRANCISCO: 78340  
##  Cruz, Rafael Edward 'Ted': 57820   SAN DIEGO    : 39967  
##  Trump, Donald J.         : 50168   OAKLAND      : 28998  
##  Carson, Benjamin S.      : 27362   SAN JOSE     : 26420  
##  Rubio, Marco             : 14092   SACRAMENTO   : 20651  
##  (Other)                  : 21834   (Other)      :843242  
##              contbr_occupation  contb_receipt_amt 
##  RETIRED              :215855   Min.   :-10500.0  
##  NOT EMPLOYED         :113153   1st Qu.:    15.0  
##  ATTORNEY             : 30572   Median :    27.0  
##  TEACHER              : 25486   Mean   :   121.8  
##  INFORMATION REQUESTED: 20936   3rd Qu.:    97.0  
##  (Other)              :719524   Max.   : 10800.0  
##  NA's                 :   133

First I decided to look at the candidates and how many times they were donated to.

cand_nm count sum meadian percent_count percent_sum
Bush, Jeb 3130 3300291.83 500 0.28 2.41
Carson, Benjamin S. 27362 2924593.00 50 2.43 2.13
Christie, Christopher J. 333 456066.00 1000 0.03 0.33
Clinton, Hillary Rodham 547211 83781357.32 25 48.61 61.10
Cruz, Rafael Edward ‘Ted’ 57820 5735382.27 50 5.14 4.18
Fiorina, Carly 4696 1468489.42 100 0.42 1.07
Gilmore, James S III 3 8100.00 2700 0.00 0.01
Graham, Lindsey O. 347 414495.00 1000 0.03 0.30
Huckabee, Mike 531 230890.60 50 0.05 0.17
Jindal, Bobby 31 23231.26 250 0.00 0.02

cand_nm percent_count
Clinton, Hillary Rodham 48.61
Sanders, Bernard 36.17
Cruz, Rafael Edward ‘Ted’ 5.14
Trump, Donald J. 4.46
Carson, Benjamin S. 2.43
Rubio, Marco 1.25
Fiorina, Carly 0.42
Paul, Rand 0.38
Bush, Jeb 0.28
Kasich, John R. 0.27

As can be seen above, Hillary Clinton and Bernie Sanders received the most donations by far (getting about 49% and 36% of the total donations respectively). This doesn’t tell us anything about the kind of donations they are receiving, so I will look at that again later in the Bivariate Analysis.

Next I wanted to look at counts for political parties, but since there wasn’t a variable for that I had to make one. In order to do that, I wrote a function to match each candidate to there political party and then created the graph shown below.

cand_party count sum meadian percent_count percent_sum
Democrat 955258 103970224.8 27 84.86 75.82
Republican 166747 32420371.5 50 14.81 23.64
Libertarian 1591 461430.6 100 0.14 0.34
Green 1907 245490.5 50 0.17 0.18
Independent 156 35135.5 100 0.01 0.03

cand_party percent_count
Democrat 84.86
Republican 14.81
Green 0.17
Libertarian 0.14
Independent 0.01

Even though I saw this coming from the graph above for candidates, Democrats received the most donations overall taking in just under 85% of the total donations.

contb_receipt_amt count sum meadian percent_count percent_sum
-10500 1 -10500 -10500 0 -0.01
-10000 1 -10000 -10000 0 -0.01
-8460 1 -8460 -8460 0 -0.01
-8300 1 -8300 -8300 0 -0.01
-8100 1 -8100 -8100 0 -0.01
-5825 1 -5825 -5825 0 0.00

contb_receipt_amt percent_count
25 13.74
50 12.40
100 11.25
10 9.04
5 6.44
27 5.39
15 4.90
250 3.75
19 2.07
35 2.07

The top three donation amounts are $25 and $50, and $100. Within the top 10, most were small donations under $100 dollars with the exception of $250 for some reason. Also, most donation amounts are divisible by five with the exceptions of $27 and $19. I wonder why those amounts was so numerous.

contbr_city count sum median lat lon percent_count percent_sum
25 5725.00 100.0 36.77826 -119.4179 0.00 0.00
. 1 2.40 2.4 36.77826 -119.4179 0.00 0.00
1000 OAKS 1 100.00 100.0 34.17056 -118.8376 0.00 0.00
29 PALMS 100 2990.52 27.0 34.13556 -116.0542 0.01 0.00
-4086 1 40.00 40.0 37.38580 -121.9731 0.00 0.00
90620BUENA PARK 1 250.00 250.0 33.84287 -118.0128 0.00 0.00
91352 1 250.00 250.0 34.23016 -118.3520 0.00 0.00
91355 1 28.00 28.0 34.44003 -118.5915 0.00 0.00
93271THREE RIVERS 2 500.00 250.0 36.43884 -118.9045 0.00 0.00
ACAMPO 72 9991.93 50.0 38.17464 -121.2786 0.01 0.01

A test mapping I performed revealed some locations out side of California, so I decided to have a look at where all of the points fall by mapping them on a world map as well as a California map.

After mapping all the points on the world map, all of the points outside of California can be clearly seen. I’m assuming the donations that came from outside California came from California residents who are living out of state, but further research into who made the donations would need to be made in order to know for sure.

Looking at the map of California only, There is a decent scattering of locations all over the state, but there is definitely a clustering of donations coming from the areas surrounding San Francisco, Los Angeles, and San Diego.

Let’s zoom in on those dense areas to get a better idea of the layout in those areas.

contbr_city percent_count
LOS ANGELES 7.82
SAN FRANCISCO 6.96
SAN DIEGO 3.55
OAKLAND 2.58
SAN JOSE 2.35
SACRAMENTO 1.83
BERKELEY 1.82
LONG BEACH 1.20
SANTA MONICA 1.11
PASADENA 0.98

Zooming in on these areas shows that the highest counts in these areas, and all of California, seem to be San Francisco, Los Angeles, and San Diego. Taking a look at the table of cities by percent count confirms this. This isn’t unexpected, as they are the three largest cities in California.

After location based on cities, I moved on to explore how occupation affected donations. However, as the graph shows below, the x-axis is overcrowded and some cleaning is needed in order to make the plot readable.

To clean the data I used information from the United States Census Bureau that used the North American Industry Classification System. Using that chart I wrote a function to group the various occupations into industry groups. However, the data is not exhaustive. There are 25,387 unique occupations listed in the data set, and trying sort all of them into their respective categories would have taken an inordinate amount of time. As a result I only included the top 100 occupations by count.

contbr_occup_categ count sum meadian percent_count percent_sum
Retired 215855 21938582.5 30.00 31.72 26.02
Professional, Scientific, and Technical Services 132483 20609591.9 38.00 19.47 24.45
Health Care and Social Assistance 45502 4599496.3 27.00 6.69 5.46
Administrative and Support and Waste Management and Remediation Services 14042 1173176.8 25.00 2.06 1.39
Unemployed 115855 6414819.2 27.00 17.02 7.61
Educational Services 48236 3643095.4 25.00 7.09 4.32
Arts, Entertainment, and Recreation 32978 4741888.6 27.00 4.85 5.62
Management of Companies and Enterprises 34821 11130702.3 50.00 5.12 13.20
Student 7531 914671.8 20.16 1.11 1.08
Real Estate and Rental and Leasing 9668 2127758.5 40.00 1.42 2.52

contbr_occup_categ percent_count
Retired 31.72
Professional, Scientific, and Technical Services 19.47
Unemployed 17.02
Educational Services 7.09
Health Care and Social Assistance 6.69
Management of Companies and Enterprises 5.12
Arts, Entertainment, and Recreation 4.85
Homemaker 2.14
Administrative and Support and Waste Management and
Remediation Services 2.06
Real Estate and Rental and Leasing 1.42
Student 1.11
Finance and Insurance 0.49
Transportation and Warehousing 0.36
Information 0.25
Agriculture, Forestry, Fishing and Hunting 0.23

Looking at the graph it’s easy to see that retirees were by far the most active contributors, followed by people in the “Professional, Scientific, and Technical Services” industry, but the most surprising was “Unemployed” people coming in at 3rd place (not that far behind 2nd place actually). I would assume unemployed people wouldn’t have the extra cash lying around to make donations, however the data seems to show otherwise.

Univariate Analysis

What is the structure of your dataset?

There are 1,125,659 different donations made with 19 different variables. Most of the 19 variables, however, were unimportant to my analysis. As a result, I dropped most of the variables. This left with with only 4 variables to work with at the start (cand_nm, contbr_city,contbr_occupation, and contb_receipt_amt).

What is/are the main feature(s) of interest in your dataset?

The main features of interest are donations in terms of the number and amount in dollars, the candidates, the contributors in terms of their occupation, and the location of donations (cand_nm, contb_receipt_amt, contbr_occupation, contbr_city). I want to see how they all interact with each other and which features influenced the count and mount of donations candidates received.

Did you create any new variables from existing variables in the dataset?

Yes, I created a variable for political parties (cand_party) and for occupation categories (contbr_occup_categ). To reiterate what I stated above with the plots, for political parties I wrote a function to match each candidate to there political party. For occupational categories I cleaned the data by using information from the United States Census Bureau that used the North American Industry Classification System. Using that chart I wrote a function to group the various occupations into industry groups. However, the data is not exhaustive. There are 25,387 unique occupations listed in the data set, and trying sort all of them into their respective categories would have taken an inordinate amount of time. As a result I only included the top 100 occupations that were listed in the summary for “contbr_occupation”by grouping the contributors occupations by major industry.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I created a variety of new tables for each variable to get counts of the donations, sums of the donations, median donations, and percentages for counts and sums. I created these table for the purposes of getting these extra statistics on the variables and to help with making certain plots (particularly for plotting cities on a map to show the geographical distribution of the donations).

Bivariate Plots Section

cand_nm percent_sum
Clinton, Hillary Rodham 61.10
Sanders, Bernard 14.31
Trump, Donald J. 7.24
Cruz, Rafael Edward ‘Ted’ 4.18
Rubio, Marco 3.53
Bush, Jeb 2.41
Carson, Benjamin S. 2.13
Kasich, John R. 1.11
Fiorina, Carly 1.07
Paul, Rand 0.58

It’s a bit hard to see all of the IQR ranges, so let’s zoom in a bit to get a better look.

When it comes to total donations in dollars, Hillary Clinton is the clear winner with about 61% of donations going to her.

There seems to be quit a variety of IQRs between all the candidates. Interestingly, most candidates have median donations at $25, and $50, and $100 (which are the three most common donation amounts). Also, most have relatively long IQRs, and a few have quit long IQRs and high median donation amounts. A full breakdown of the numbers are below.

## PCF$cand_nm: Bush, Jeb
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -5400      50     500    1054    2700   10000 
## -------------------------------------------------------- 
## PCF$cand_nm: Carson, Benjamin S.
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -10000.0     25.0     50.0    106.9    100.0  10000.0 
## -------------------------------------------------------- 
## PCF$cand_nm: Christie, Christopher J.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -2700     100    1000    1370    2700    5400 
## -------------------------------------------------------- 
## PCF$cand_nm: Clinton, Hillary Rodham
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -5400.0    15.0    25.0   153.1   100.0 10000.0 
## -------------------------------------------------------- 
## PCF$cand_nm: Cruz, Rafael Edward 'Ted'
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -8300.00    25.00    50.00    99.19   100.00 10800.00 
## -------------------------------------------------------- 
## PCF$cand_nm: Fiorina, Carly
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -3300.0    25.0   100.0   312.7   250.0  5400.0 
## -------------------------------------------------------- 
## PCF$cand_nm: Gilmore, James S III
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2700    2700    2700    2700    2700    2700 
## -------------------------------------------------------- 
## PCF$cand_nm: Graham, Lindsey O.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -2700     100    1000    1195    2700    8100 
## -------------------------------------------------------- 
## PCF$cand_nm: Huckabee, Mike
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2700.0    25.0    50.0   434.8   500.0  5400.0 
## -------------------------------------------------------- 
## PCF$cand_nm: Jindal, Bobby
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    11.1   250.0   250.0   749.4  1000.0  2700.0 
## -------------------------------------------------------- 
## PCF$cand_nm: Johnson, Gary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -607.9    50.0   100.0   290.0   250.0  2742.0 
## -------------------------------------------------------- 
## PCF$cand_nm: Kasich, John R.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -5400.0    50.0   100.0   505.7   500.0  2700.0 
## -------------------------------------------------------- 
## PCF$cand_nm: Lessig, Lawrence
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2700.0    50.0   250.0   500.4   500.0  2700.0 
## -------------------------------------------------------- 
## PCF$cand_nm: McMullin, Evan
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -500.0    25.0   100.0   225.2   250.0  2700.0 
## -------------------------------------------------------- 
## PCF$cand_nm: O'Malley, Martin Joseph
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2700.0    50.0   250.0   750.2  1000.0  5400.0 
## -------------------------------------------------------- 
## PCF$cand_nm: Pataki, George E.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     100     500    1000    1522    2700    2700 
## -------------------------------------------------------- 
## PCF$cand_nm: Paul, Rand
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -5400      25      50     187     100    5400 
## -------------------------------------------------------- 
## PCF$cand_nm: Perry, James R. (Rick)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -2700    1000    2700    1797    2700    2700 
## -------------------------------------------------------- 
## PCF$cand_nm: Rubio, Marco
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -5400.0    25.0    75.0   343.6   250.0  5400.0 
## -------------------------------------------------------- 
## PCF$cand_nm: Sanders, Bernard
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -10500.00     15.00     27.00     48.21     50.00  10000.00 
## -------------------------------------------------------- 
## PCF$cand_nm: Santorum, Richard J.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     5.0    25.0    65.0   413.7   262.5  2700.0 
## -------------------------------------------------------- 
## PCF$cand_nm: Stein, Jill
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -300.0    29.0    50.0   128.7   100.0  2700.0 
## -------------------------------------------------------- 
## PCF$cand_nm: Trump, Donald J.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -3716.0    28.0    80.0   197.8   200.0  5400.0 
## -------------------------------------------------------- 
## PCF$cand_nm: Walker, Scott
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -5400.0   100.0   250.0   676.1  1000.0 10800.0 
## -------------------------------------------------------- 
## PCF$cand_nm: Webb, James Henry Jr.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     5.0   100.0   275.0   722.3   875.0  5400.0

Looking at donation amounts by candidates, while Hillary Clinton and Bernie Sanders were again the top two, they both had one of the lowest median donations at $25 and $27 respectively. This helps solve the mystery of $27 being one of the most common donation amount. Since Bernie had the second most donations by count, it makes sense that his median donation would be one of the most common. Why $27 was such a popular amount to give Bernie remains a mystery however.

cand_party percent_sum
Democrat 75.82
Republican 23.64
Libertarian 0.34
Green 0.18
Independent 0.03

We can see that Democrats and Republicans received some of the highest donation amounts, but we can’t really make out the IQR ranges. Let’s zoom in to get a closer look.

## PCF$cand_party: Democrat
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -10500.0     15.0     27.0    108.8     67.0  10000.0 
## -------------------------------------------------------- 
## PCF$cand_party: Republican
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -10000.0     25.0     50.0    194.4    100.0  10800.0 
## -------------------------------------------------------- 
## PCF$cand_party: Libertarian
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -607.9    50.0   100.0   290.0   250.0  2742.0 
## -------------------------------------------------------- 
## PCF$cand_party: Green
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -300.0    29.0    50.0   128.7   100.0  2700.0 
## -------------------------------------------------------- 
## PCF$cand_party: Independent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -500.0    25.0   100.0   225.2   250.0  2700.0

The results for political party are definitely not a surprise given how much more money Hillary and Bernie received individually (especially Hillary), but they did have a smaller percentage of total donations in dollars than by count. However, while Democrats gave the most overall they seem to have the lowest median donation out of all other parties. Does this mean that Democrats tend to be poorer? That they have less money to spare than the others? More information would be needed to answer these questions. Also, the IQRs of the other parties tend to come at more regular amounts like $25, $50, and $100. Perhaps the Democrats were just more likely to use the “Other” amount box instead of choosing the preset suggested donation amounts that are often given.

Regardless, in order to find out if there is a statistically significant difference between the political parties, I will conduct a Kruskal-Wallis rank sum test.

## 
##  Kruskal-Wallis rank sum test
## 
## data:  contb_receipt_amt by cand_party
## Kruskal-Wallis chi-squared = 35874, df = 4, p-value < 2.2e-16

The results above show that, with a p-value well below 0.05, there is indeed a statistical significance between the political parties and the donation amounts received.

However, this doesn’t tell us much about which parties differ from each other, so I will run a post-hoc analysis to determine this.

(NOTE: Groups sharing the same letter are not significantly different.)
Group Letter MonoLetter
Democrat a a
Republican b b
Libertarian c c
Green d d
Independent cd cd

After running the post-hoc analysis using the Dunn Test, it is easy to see that almost all of the parties are significantly different from each other with the exception of Independents having no statistical significance from both Libertarians and Greens.

contbr_city percent_sum
LOS ANGELES 10.95
SAN FRANCISCO 10.19
SAN DIEGO 2.48
PALO ALTO 2.20
BEVERLY HILLS 2.17
OAKLAND 2.06
SANTA MONICA 1.94
BERKELEY 1.90
SACRAMENTO 1.56
SAN JOSE 1.55

In general, these maps based on location and contribution totals in dollars mirror the maps from earlier based on location and contribution totals by count. However, there are some exceptions. When looking at the raw numbers in the table, Beverly Hills and Palo Alto have moved up on the list for total contributions in dollars, I assume, because of these areas affluence.

Like I did above with political parties, in order to find out if there is a statistically significant difference between the political parties, I will conduct a Kruskal-Wallis rank sum test.

## 
##  Kruskal-Wallis rank sum test
## 
## data:  contb_receipt_amt by contbr_city
## Kruskal-Wallis chi-squared = 44295, df = 2117, p-value < 2.2e-16

The results above show that, with a p-value well below 0.05, there is indeed a statistical significance between cities and the donation amounts given.

contbr_occup_categ percent_sum
Retired 26.02
Professional, Scientific, and Technical Services 24.45
Management of Companies and Enterprises 13.20
Unemployed 7.61
Arts, Entertainment, and Recreation 5.62
Health Care and Social Assistance 5.46
Homemaker 5.39
Educational Services 4.32
Real Estate and Rental and Leasing 2.52
Finance and Insurance 2.16
Administrative and Support and Waste Management and
Remediation Services 1.39
Student 1.08
Agriculture, Forestry, Fishing and Hunting 0.51
Information 0.15
Transportation and Warehousing 0.11

The only obvious changes between contribution amounts by occupation and contribution count by occupation are with increases in seven of the categories (with the biggest in the “Professional, Scientific, and Technical Services” and “Management of Companies and Enterprises” categories) and decreases three categories (with the biggest being the “Unemployed” category). Retirees, and “Professional, Scientific, and Technical Services” and “Management of Companies and Enterprises” gave the most as groups.

## 
##  Kruskal-Wallis rank sum test
## 
## data:  contb_receipt_amt by contbr_occup_categ
## Kruskal-Wallis chi-squared = 17346, df = 14, p-value < 2.2e-16

The results above show that, with a p-value well below 0.05, there is indeed a statistical significance between cities and the donation amounts given.

(NOTE: Groups sharing the same letter are not significantly different.)
Group Letter MonoLetter
Retired a a
Professional,Scientific,andTechnicalServices b b
HealthCareandSocialAssistance a a
AdministrativeandSupportandWasteManagementand
RemediationServices c c
Unemployed c c
EducationalServices d d
Arts,Entertainment,andRecreation e e
ManagementofCompaniesandEnterprises f f
Student g g
RealEstateandRentalandLeasing h h
Homemaker h h
TransportationandWarehousing i i
FinanceandInsurance j j
Information e e
Agriculture,Forestry,FishingandHunting k k

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

When looking at the individual candidates there doesn’t seem to be any general relationship at all, but once grouped into political parties there seems to be relationship between belonging to a party (especially the two major parties), and the amount of donations in dollars. Both the major parties (Democrats and Republicans) had the highest donation amounts, with Democrats being the clear winner. However, even though Democrats had the highest donation total overall, they also had the lowest mean donation amount at $27. Not only did the most people donate the most money to Democrats, but they did so at relatively modest amounts per individual.

In addition, there seems to be a relationship between cities and the amount of donations given, with big cities (specifically San Francisco, Los Angeles, and San Diego) and the immediate surrounding area giving a higher amount of money than the countryside.

Finally, looking at donations by occupation showed that a few groups (specifically retirees, “Professional, Scientific, and Technical Services” and “Management of Companies and Enterprises) gave in amounts well above all the other groups.

What was the strongest relationship you found?

From what I can see, no single relationship was stronger than the others. After running a Kruskal-Wallis rank sum test on all of the relationships, all three variables (political parties, location by city, and occupation) showed strong relationships to donations (with p-values < 2.2e-16) with certain political parties, cities, and occupations receiving and giving far more donations (in both count and sums) than the others.

Multivariate Plots Section

Hmm, this graph seems a bit convoluted with all 25 candidates and 15 occupation categories together in one graph. Probably best to take a look at this by political party instead.

contbr_occup_categ cand_nm cand_party count sum meadian
Retired Bush, Jeb Republican 996 451364.00 50
Retired Carson, Benjamin S. Republican 13692 1168789.56 50
Retired Christie, Christopher J. Republican 38 29065.00 100
Retired Clinton, Hillary Rodham Democrat 129408 12342944.22 25
Retired Cruz, Rafael Edward ‘Ted’ Republican 23195 1715291.66 50
Retired Fiorina, Carly Republican 1971 343295.47 50
Retired Graham, Lindsey O. Republican 80 61645.00 250
Retired Huckabee, Mike Republican 206 49290.50 50
Retired Jindal, Bobby Republican 5 2250.00 250
Retired Johnson, Gary Libertarian 233 53835.55 100

With Democrats getting so much more money than everyone else, it’s hard to see the distributions of the rest of the parties. Let’s zoom in to get a better look.

With all the high value outliers it’s hard to see the IQRs of all the parties. Let’s zoom in to get a better look.

For these plots I decided to look at political parties and how much money was given to each party by occupation. I decided to look at donations in two different ways, first by the sum donations and second by the individual donation amounts. From the plots you can see that Democrat’s received more donations by sum in every occupation except “Agriculture, Forestry, Fishing and Hunting” (Republicans received more). In terms of donation amounts, Democrats had the most consistent median value across all occupations. The further right on the plot of political parties the more the median values vary (with Independents seeming to vary the most widely). Perhaps this is due to the number of people who donated being lower for these parties.

contbr_city cand_party count sum median lat lon
Democrat 25 5725.00 100.0 36.77826 -119.4179
. Republican 1 2.40 2.4 36.77826 -119.4179
1000 OAKS Republican 1 100.00 100.0 34.17056 -118.8376
29 PALMS Democrat 97 2900.52 27.0 34.13556 -116.0542
29 PALMS Republican 3 90.00 30.0 34.13556 -116.0542
-4086 Republican 1 40.00 40.0 37.38580 -121.9731
90620BUENA PARK Republican 1 250.00 250.0 33.84287 -118.0128
91352 Republican 1 250.00 250.0 34.23016 -118.3520
91355 Republican 1 28.00 28.0 34.44003 -118.5915
93271THREE RIVERS Republican 2 500.00 250.0 36.43884 -118.9045

Looking at these maps, it seems that while some towns where Democratic donors dominated can be seen spread throughout California, it seems like a majority of them are concentrated in and around major cities. The other parties (Republicans especially) are much more spread out across California. In the cities, Democrats had a clear dominance in terms of the amount of money donated.

contbr_city contbr_occup_categ count sum median lat lon
Retired 12 2750.00 250.0 36.77826 -119.4179
Unemployed 10 70.00 5.0 36.77826 -119.4179
Management of Companies and Enterprises 1 2700.00 2700.0 36.77826 -119.4179
NA 2 205.00 102.5 36.77826 -119.4179
. NA 1 2.40 2.4 36.77826 -119.4179
1000 OAKS Transportation and Warehousing 1 100.00 100.0 34.17056 -118.8376
29 PALMS Unemployed 4 95.00 25.0 34.13556 -116.0542
29 PALMS NA 96 2895.52 27.0 34.13556 -116.0542
-4086 Retired 1 40.00 40.0 37.38580 -121.9731
90620BUENA PARK NA 1 250.00 250.0 33.84287 -118.0128

From what I can see, it seems that there are a wider variety of jobs clustered around the cities. The further out into the countryside you go, “Educational Services” and “Agriculture, Forestry, Fishing and Hunting” jobs seem to dominate in terms of donations.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Democrat’s received more donations by sum in every occupation except “Agriculture, Forestry, Fishing and Hunting” (Republicans received more). In terms of donation amounts, Democrats had the most consistent median value across all occupations at around $27. The further left on the plot the political party the more the median values vary (with Independents seeming to vary the most widely). Perhaps this is due to the number of people who donated being lower for these parties.

Democratic donors are less spread out and are concentrated in and around major cities. The other parties (Republicans especially) are much more spread out across California. In the cities, Democrats had a clear dominance in terms of the amount of money donated.

It can be hard to distinguish between some of the colors for the occupations, but certain occupations seem to be clustered in various regions, such as a clustering of the occupations “Student”, “Real Estate and Rental and Leasing”, and “Homemaker” in and around Los Angeles (likely due to a high volume of universities, real estate opportunities, and and families in this area), and a clustering of the occupation category “Agriculture, Forestry, Fishing, and Hunting” down the center of California (which is where a lot of agriculture is done in the Central Valley).


Final Plots and Summary

Plot One

Description One

Looking at the total donations for political parties, the plot shows that Democrats received the most donations by far than any other political party.

Plot Two

Description Two

Median donations are fairly uniform across occupations with most being in between $25 and $50.

Plot Three

Description Three

These plots show the distribution of donations made to the various political parties by city. While Democrat’s may have given more money overall, most of that came from larger cities. The other parties (Republicans especially) are much more spread out.


Reflection

This report explores a data set containing financial contributions made by California residents to Presidential candidates in the 2016 Presidential election. While the data set contained 19 variables I whittled down to only the 4 variables I was I thought were useful to analyze.

In general, this analysis seems to confirm a lot of my preconceived notions about the political landscape in California. After running the analysis, the numbers show that California residents overall, and especially those in bigger cities, lean Liberal and vote for and support Democrats. The Democratic Party beat the other parties in all the aspects I looked at. Overall, they had the most people donate to them and the most money donated to them, and when broken down by occupation they won the support of every occupational category across the board in both count and dollar amount (with the exception of the “Agriculture, Forestry, Fishing, and Hunting” category). With all the surprises in this election, however, it would be interesting to run another analysis on past elections and see how well the trends from this election hold.

There are some limitations in my analysis when it comes to the my analysis of occupations. As I mentioned briefly earlier in my analysis, there are 25,387 unique occupations listed in the data set, and trying sort all of them into their respective categories would have taken an inordinate amount of time. As a result I only included the top 100 occupations. If I had included all of the occupations, the numbers might be a bit different. Perhaps the remaining donations would have gone to Republicans of other candidates and the Democrats might not have had quit the same dominance across the board. Then again, perhaps nothing significant would have happened. However, because of this limitation the results for donations by occupation should be taken with a grain of salt.


Citations